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Abstract 

We are interested in testing multiple hypotheses using tests that can only be evaluated by 
simulation such as permutation tests or bootstrap tests. This article introduces a sequential 
algorithm which modifies standard procedures which work with exact p-values, such as the 
Benjamini & Hochberg False Discover Rate (FDR) procedure, and controls the number of 
samples being drawn for each hypothesis. We show that, with arbitrary high probability, 
the algorithm gives the same classification as the original procedure with the exact p-values. 
The method is not only applicable to the original FDR procedure but also extends to a wider- 
class of FDR controlling procedures and to controlling the Familywise Error Rate using the 
Bonferroni correction. At any stage, the algorithm can be interrupted and returns a set of 
hypotheses which can already be classified with satisfactory precision and a set of hypotheses 
whose decision is still pending. 

Keywords: Bootstrap/resampling, Computationally Intensive Methods, Multiple Compar- 
isons, False Discovery Rate, Sequential Algorithm 



1 Introduction 



Consider multiple hypotheses to be tested for statistical significance using a procedure which 



corrects for the multiplicity, such as the method by Benjamini and Hochberg (1995) or the 



Bonferroni correction (Bonferroni 1936). Standard procedures require exact knowledge of all p- 



values. We consider the case where p-values are not known exactly and can only be computed by 
simulation. For example, this occurs when using bootstrap or permutation tests. Recent studies 



involving such tests include Pekowska et al. (2010) using data from a genome data archive, Lage- 



Castellanos et al. (2010) using brain activity data and Jiao and Zhang (2010) using microarray 



data. 

This article introduces MMCTest, an algorithm to implement multiple Monte-Carlo tests. The 
algorithm is sequential: it starts with all hypotheses being unclassified and then proceeds taking 
samples until all but a certain number of hypotheses have been classified or until a certain effort 
is reached. The proposed algorithm can be stopped earlier while having the same guarantee 
on the probability of misclassifications. When being stopped before all hypotheses have been 
classified, the algorithm returns three sets: the significant, the not significant and the not yet 
classified hypotheses. 

The algorithm gives, with pre-specified probability, the same classification (rejected and 
non-rejected hypotheses) as the classification based on the exact p-values. It aims to use fewer 
samples for hypotheses which can already be classified with sufficient confidence and more sam- 
ples for hypotheses which cannot be classified yet. 



Sandve et al. (2011) suggested an algorithm, called MCFDR, which is related to our approach. 



It is a modification of the sequential method by Besag and Clifford (1991). In contrast to our 
new method, the MCFDR method does not give any guarantees on how the results relate to the 
FDR procedure applied to the exact p-values. 

The basic MMCTest algorithm will be described in Section [2] Moreover, Section [2] states 
conditions which need to be satisfied to guarantee the bound on the probability of classification 
errors and to guarantee the convergence of the algorithm's output of sets of rejected and non- 
rejected hypotheses to the sets computed using the exact p-values. In Section [3] we show that 
the multiple testing procedure by Benjamini and Hochberg (1995) and the Bonferroni correction 
satisfy these conditions. We also discuss the extension of the FDR procedure using estimators 
by Pounds and Cheng ( 2006 ) . The dependence of the computational effort of the proposed algo- 
rithm on certain parameters is assessed empirically in Section [4j Furthermore, its performance 
is compared to the MCFDR algorithm. Section [5] contains a concluding discussion. 

All proofs can be found in the Appendix. The MMCTest algorithm is implemented in an 
R-package (simctest, available on CRAN, The Comprehensive R Archive Network). 

In this article, | • | denotes the number of elements in a finite set and the length of an 
interval. Moreover, ||-|| denotes the Euclidian norm of a vector and A c denotes the complement 
of A C {1, ... , m} with respect to {1, ... , m}, where m denotes the number of hypotheses under 
consideration. 



2 Description of the algorithm 

2.1 Basic algorithm 

Consider testing m hypotheses Hqi, ■ ■ ■ , ffo™ having corresponding test statistics Ti, . . . , T m and 
observed values t\,...,t m . A large value of ti shall indicate evidence against Hq{. Moreover, let 
p* denote the exact p- value belonging to the potentially estimated hypothesis H^i. We assume 
that p*, . . . ,p* m are not available analytically, but have to be obtained through simulations. 

We assume that for every potentially estimated hypothesis , i E { 1 , . . . , m} , we can 
obtain samples from the test statistic Tj under the null hypothesis. Thus, by considering if 
these realizations exceed the observed value ti, we can generate independent random variables 
Xij ~ Bernoulli^*), j G N. 

Suppose that h : [0, l] m — > V({1, ■ ■ ■ ,m}) takes a vector of p-values and returns the set of 
indices of hypotheses to be rejected, where V denotes the power set. We will call any such 
function a multiple testing procedure. Ultimately, we are interested in obtaining A* := h(p*), 
which we refer to as the true set of rejections. 



For example, h can be the FDR controlling procedure by Benjamini and Hochberg (1995) 
with threshold a > 0, which works as follows. Given m p-values pi, ■ ■ ■ ,p m , their order statistic 
is denoted by p^ < £>( 2 ) < • • • < P( m )- I n case of a tie, equal values are assigned a rank in 
arbitrary order. Let k be the largest index i for which p^ < ~a. Then, rejecting all hypotheses 
corresponding to P(i), ■ ■ ■ ,P(k) controls the False Discovery Rate at threshold a. The procedure 
can also be expressed as 

Hp) = \ i e{l,...,m} : 3j : r p (j) > r p (i) and m^— < a 

where r p (i) denotes the rank of pi in the sorted sequence of p-values. 
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We call a multiple testing procedure h monotonic if h(p) I) h(q) Vp < q, where p,q G [0, l] m , 
i.e. if lower p-values lead to more rejections. We show in Theorem [5] in Section |3.1| that the 
Benjamini-Hochberg procedure is monotonic. 

The following generic algorithm is designed for monotonic multiple testing procedures. It 
iteratively controls the set of hypotheses for which further samples need to be drawn by refining 
confidence intervals for every p* through Monte-Carlo sampling. At iteration n, the confidence 
interval for the p- value p* is denoted by If. For simplicity, we work with closed confidence 
intervals. 

Algorithm 1. [MMCTest] 

1. n:=0, A :=0, A :={l,...,m}. 

2. While (\A n \A n \ > c and total number of samples is below an upper limit) 

(a) n := n + 1. 

(b) Compute confidence intervals If for all i S {1, . . . ,m} by calling a subalgorithm 
ImproveIntervals(A n \ A n ). 

(c) Set A n := h((max If)i=i,..., m ), K, ■= h((minlf )i=i,..., m ). 

3. Return (A n , A n ). 

Remark 1. 1. The subalgorithm Improvelntervals takes a set as argument and computes 
refined confidence intervals for all indices in this set. A particular implementation is 
presented in Section \2.2\ which ensures that the probability of no false classifications is at 
least 1 — e, where e £ (0, 1) is a constant to be chosen by the user. 

2. The algorithm runs until at most c > hypotheses are classified or until the total number 
of samples drawn reaches a pre-specified upper limit. 



Example 1. An example run of MMCTest (with c = 0) is visualized in Figure^ It uses m = 10 
hypotheses and the Benjamini-Hochberg FDR controlling procedure with threshold a = 0.4. The 
confidence intervals have been computed as described in Section 2.2. Columns show different 
iterations, the upper row shows the computation of A n , the lower row shows the computation 
°f A n - Only the lower (upper) end of the confidence interval matters for the computation of 
A n (A n ) thus the hypotheses are ordered by their lower (upper) confidence limit in the upper 
(lower) row. In this example this turns out to be the same ordering. In the second iteration (left 
column), MMCTest has already classified the last hypothesis as being non-rejected as the lower 
confidence limit of its p-value lies above the line connecting the points (0, 0) and (m, a) which 
we call the Benjamini-Hochberg line. All other hypotheses are still undecided and thus their 
confidence intervals will be refined. After a few additional iterations (middle column), the seven 
smallest values can be classified as rejected as the upper confidence limit of the seventh value 
is below the line. Likewise, the confidence interval of the ninth value has now been shrunk to 
be entirely above the line which classifies this value as non-rejected. The eighth p-value is still 
unclassified as its confidence interval overlaps with the line. After refining the confidence interval 
further, the algorithm stops in the situation depicted in the right column: the two classifications 
in A n and A n have converged. 
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Figure 1: Visualization of MMCTest after the second iteration (left), after a few additional 
iterations (center) and after the last iteration (right); in the upper row, the Benjamini-Hochberg 
procedure h is applied to lower confidence limits, where the set A n of rejected hypotheses is 
visualized with bold confidence intervals; the lower row shows the rejections A n when applying 
h to upper confidence limits. 
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Under conditions, the monotonicity of h implies immediately that the sequence of sets A n 
is increasing, that the sequence of sets A n is decreasing and that each A n (A n ) is a subset 
(superset) of the true set of rejections A*. 

Lemma 1. Assume that h is monotonic. 

1. If all confidence intervals If are nested, i.e. iflf +1 C If Vi,n ; then Q4 n )neN /* and 

2. Ifp* G If Vi, n, then A n Q A* <Z A n Vra G N. 
2.2 Computation of confidence intervals 

In this section we state conditions on the algorithm Improvelntervals and on the multiple 
testing procedure h which guarantee that the classification by MMCTest is correct with high 
probability and that all hypotheses will be classified. After that, we present one specific imple- 
mentation of algorithm Improvelntervals which satisfies these conditions. 

The main theorem stated in this section relies on two properties of the multiple testing 
procedure h, collected in the following condition. 

Condition 1. 1. h is monotonic. 

2. Let p,q G [0, l] m . If qi < pi Vi G h{p) and qt > p, L Vi ^ h(p), then h(p) = h(q). 

Besides asking for monotonicity, Condition [T] ensures that lowering the p- value of a rejected 
hypothesis or increasing the p-value of a non-rejected hypothesis does not change the result of 
h. 

Recall that for % G {1, . . . , m}, independent random variables Xy ~ Bernoulli(p*), j G N, 
simulate test statistics exceeding the null hypothesis. The next condition addresses subalgorithm 
Improvelntervals and states that the length of all confidence intervals which are not stopped 
from being improved goes to zero. 

Condition 2. Let B n C {1, . . . ,m}, n G N, and let Xij G {0, 1}, j G N, be any sequence for 
i G {1, . . . , m}. Then \I™\ — ^ as n — > oo for all i G limsup n _ > . 00 B n = f] u>1 \J n > u B n , where 
= ImproveIntervals(B n ). 

The main theorem can now be stated. 

Theorem 2. Suppose Conditions^ and^ hold and suppose that there exists 5 > such that 
p G [0, l] m and \\p — p*\\ < 5 imply h(p) = h(p*). Then, on the event {p* G If Vi,n}, both 
sequences Q4 n ) n eN and {A n ) n ^ converge to A*, i.e. there exists no G N such that A n = A n = A* 
Vn > no- 

The condition on p* in Theorem [2] ensures that p* has a neighborhood on which h is constant. 
The next condition ensures that the confidence intervals computed by subalgorithm 
Improvelntervals have a guaranteed minimal joint coverage probability. 

Condition 3. For a given e > 0, subalgorithm Improvelntervals computes confidence intervals 
If in such a way that G If Vi, n) > 1 — e. 

The main theorem and Condition [3] together immediately give a bound on the probability 
of misclassifications. 
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Corollary 3. Under the conditions of Theorem^ and under Condition^ 



P(3n : A n = A* = A n Vn > n ) > 1 - e, 
i.e. the probability that all classifications are correct is at least 1 — e. 

Next, we give an explicit implementation of the subalgorithm Improvelntervals which will 
be used in the examples of this article. 

Parameters of this algorithm are e > 0, the error probability for any false classification one is 
willing to tolerate and (rj k ) k ^ o , a sequence such that = 770 < rji < . . . and rj k — > 1 — (1 — e) 1//m 
as k — > 00, which is used to control how e is spent over the iterations of the algorithm. We refer 
to (rj k ) as spending sequence. Furthermore, two constants a > 1 and A > 1 control how many 
additional samples are drawn in each iteration. For Example [I] and the examples in Section [4] 
we have used a = 1.25 and A = 10. 

In the algorithm, two vectors S,k £ keep track of counts. They have to be initialized to 
S = k = before running MMCTest. 

Algorithm 2. [Improvelntervals (B )] 

1. A := aA. 

2. For all i G B: 



(a) := + J2 k j=kf+i 

(b) h := ki + A. 



X,, 



(c) I, 



1 1ki- a Si,Si+l^P k i)i 1 



Phi 



0,1- 
' l/ki 



l/ki 
Pki 



1 



< Si < kt, 
Si = 0, 



Si 



ki, 



where p ki = (rj ki -rj ki - A )/2. 
3. If ■- I^ 1 f or am B. 

Remark 2. 1. The confidence intervals computed in step 2.(c) of Algorithm^ are the exact 
"Clopper- Pears on" confidence intervals, see Clopper and Pearson ^934). The quantiles 
q^ e p a {e) of the Beta(a,(3) distribution being used are defined by ¥(Z < q^ a (e)) = e for 

a random variable Z with probability density function p?^wg^ z a ~ l (1 — z)^~ l . Comput- 
ing confidence intervals in this way leads to slightly conservative coverage probabilities in 
practice, see li et al. (2009). 

2. The "Clopper-Pearson" confidence intervals we compute in Algorithm^ are not necessarily 
nested and thus the first result of lemma [I] does not hold. This can easily be fixed by 
intersecting the current confidence interval with the one computed in the previous iteration. 

3. With ki being the number of samples drawn up to iteration n, 1 — rj k . is the minimal joint 
coverage probability of if , ... , If for p* . 

As shown next, Algorithm [2] computes confidence intervals in such a way as to satisfy Con- 
dition [2] and Condition [3j 
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Lemma 4. Algorithm satisfies Conditions and [3| when using the spending sequence ry, = 
(l — (1 — e) 1 /™) for a constant r > 0, where e is i/ie overall error probability, 

Remark 3. When being called through ImproveIntervals(A n \ A n ) in MMCTest, Algorithm^ 
works as follows: Firstly, the number of additional samples A drawn in every step is increased 
geometrically. For all hypotheses which are still considered, i.e. those in A n \A n , an additional 
batch of A samples is drawn. The total number of samples drawn up to iteration n for a hypoth- 
esis i G {1, . . . , m} is stored in ki and the total number of significant test statistics is stored in 
Si . Then, the exact Clopper- Pears on confidence interval If for p* , i G { 1 , . . . , m} , is computed 
using ki and Si. The confidence interval remains unchanged for each hypothesis i ^ A n \ A n 
which is not considered in the current iteration. 



3 Some properties of multiple testing procedures 

In this section we discuss how and under which circumstances several multiple testing procedures, 



namely the procedure by Benjamini and Hochberg (1995), its extension by Pounds and Cheng 
(2006) and the Bonferroni (1936) correction, satisfy the conditions of Theorem pi 



3.1 Properties of the Benjamini-Hochberg procedure 

The next theorem shows three properties of the Benjamini-Hochberg procedure h given in Section 



2.1 which are slightly stronger than Condition [TJ 



Theorem 5. 1. h is monotonic. 

2. Letp,qe [0,l] m . If q { < V* G h(p) and q { = Pi Vi g h(p), then h{p) = h{q). 

3. Let p,q G [0, l] m . // qi = pi Mi G h(p) and qi > %r p (i) \/i £ h(p), then h(p) = h(q). 

The second statement shows that all p- values in the set of rejections can be increased up to 
a certain bound without affecting the result of h. The third statement shows that the result of 
h is not affected if p- values in the non-rejection area are replaced by arbitrary values above the 
Benjamini-Hochberg line (see Example [T]). 

Corollary 6. h satisfies Condition^ 

The next lemma states that h is locally constant for almost all arguments: 

Lemma 7. If p* G [0, l] m with p*,^ ^ ia/m, i G {1, . . . , m}, then there exists 5 > such that 
p G [0, l] m and \\p — p*\\ < 5 imply h(p*) = h(p). 

Lemma [7] thus shows that the condition on p* in Theorem [2] is satisfied for all p- values except 
for those lying exactly on the Benjamini-Hochberg line. 



3.2 Properties of the Bonferroni correction 

The Bonferroni correction controls the Familywise Error Rate, defined by FWER := P(V > 1), 
where V is the number of hypotheses from the null which have been declared significant (false 
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positives). The method by Bonferroni (1936) tests all m hypotheses Hq±, . . . ,Ho m at threshold 



a/m to guarantee FWER < a. The Bonferroni correction h B can be stated as 

hBip) = {i e {1, ■ ■ ■ ,m} : Pi < a/m} . 

The output h B {p) is the set of rejected indices. 

Similarly to Theorem [5j the following theorem states two key properties of h B which are 
slightly stronger than the corresponding statements of Condition [T] 

Theorem 8. 1. h B is monotonia. 

2. Letp,qe [0, l] m . If q { < & Vi G h B (p) and q t > % Vi $ h B (p), then h B (p) = h B (q). 

The second statement of Theorem [8] shows that the result oih B is not affected if p- values in 
the rejection (non-rejection) area are replaced by arbitrary values below (above) the constant 
testing threshold a/m. 

Corollary 9. h B satisfies Condition^ 

Similarly to Lemma [7J the Bonferroni correction is locally constant for almost all values: 

Lemma 10. For all p* G ([0, a/m) U (a/m, l]) m there exists 5 > such that p G [0,l] m and 
||p — < 5 imply h B (p*) = h B {p). 



Lemma 10 thus shows that the condition on p* in Theorem [2] is satisfied for all p- values 
except for those lying exactly on the threshold a/m. 

3.3 Extension to other FDR-controlling procedures 



Benjamini and Hochberg (1995) showed that their procedure actually controls the FDR at 



threshold ^ a < a instead of a, where mo is the unknown number of true null hypotheses. 
A stronger control of the FDR can thus be archived by using the threshold a/no, where the 



proportion of true null hypotheses ttq := mo/m has to be estimated. Pounds and Cheng (2006) 
propose to use the estimator ttq(p) = min(l, ^ YllLiPi)- 

In this section, we explicitly regard the Benjamini-Hochberg procedure as being a function 
of both p- values and threshold a, 



h:[0,l] m x[0,l]^V({l,...,m}) 



{p, a) i-)- { i G {1, . . . , m} : 3j : r p (j) > rJi) and m 



Pj 

r pU) 



< a 



Using this notation, the multiple testing procedure by Pounds and Chengj (2006) can be 
written as hpc(p) = h(p,a/TTo(p)). 

We have already shown in Theorem [5] that h is a decreasing function in p. The following 
lemma shows that h is also an increasing function in a. 

Lemma 11. Let p G [0, l] m . Then < a\ < 02 < 1 implies h(p, a\) C h(p, 02). 

This immediately implies that hpc satisfies the first statement of Condition [Tj 
Corollary 12. hpc is monotonia. 
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However, the second statement of Condition [T] is not satisfied as shown in the following 
remark. 

Remark 4. Consider m = 2 hypotheses. Let a = 2/5 and p = (1/5,3/5). Then ttq(p) = 4/5 
and hpc(p) = h(p, 1/2) = {1}. Let q = (0,3/5). Then q% < pi Vi G hpc{p) = {1} and qi = pi 
Vi ^ hpc(p)- Now tto{q) = 3/5, implying hpc{q) = {1,2} ^ hpc(p)- Thus the second statement 
of Condition^ is not satisfied. 

When using MMCTest and Improvelntervals in conjunction with hpc, the following sce- 
nario can occur. In every iteration, MMCTest applies hpc to upper and lower confidence levels 
(max/™) and (mini™) and thus applies the Benjamini-Hochberg FDR controlling procedure 
at two different thresholds a/-7To((max/f )j=i r .. >m ) and a/7ro((min/™)j = i i ... jm ). Those thresholds 
vary in every iteration as estimates of p- values vary. As more p- values are classified and leave the 
active set B n = A n \ A n , the remaining p- values in B n are less likely to significantly change the 
two thresholds. A single confidence interval can get trapped between the Benjamini-Hochberg 
lines corresponding to the two thresholds, and thus will never be classified. 

An ad-hoc correction to MMCTest is the following procedure. First, algorithm 
Improvelntervals is applied as usual to all active hypotheses in B n . Let N denote the total 
number of new samples drawn during this step. Second, iV/|.B^| new samples are drawn for each 
inactive hypothesis in B^, resulting in the same total number ./V of new samples spent again 
on all inactive hypotheses in B^. New confidence intervals are then computed for all p- values 



p\, ■ ■ ■ ,Pm- This correction is used for the empirical comparison in Section 4.3 As shown in 
the proof of Lemma |4j the length of the Clopper-Pearson confidence intervals computed in the 
Improvelntervals algorithm tends to zero as the sample size tends to infinity. Drawing further 
samples for each hypothesis in every iteration thus implies convergence of the above correction 
to the classification using exact p- values. 

4 Empirical comparison 

This section starts with an empirical assessment of the dependence of the computational effort 
of MMCTest on the number of hypotheses m. After that, we illustrate the convergence of the 



sets A n and A n to A*. A simulation in Section 4.2 suggests that the effort for classifying most 



hypotheses is reasonable, but that the effort for a complete classification can be very large. 



Section 4.3 contains an empirical comparison of MMCTest to MCFDR by Sandve et al. (2011). 

The p-values used in this section are simulated from a mixture distribution consisting of a 
proportion ttq (the proportion of true null hypotheses) from a uniform distribution in [0, 1] and a 
remaining proportion 1 — ttq of p-values drawn from a Beta(0.25, 25) distribution. This mixture 



distribution was already used in Sandve et al. (2011). 

In all simulations, Algorithm [2] was used to compute confidence intervals in step 2.(b) of 
MMCTest. The batch size A in Algorithm [2] was increased geometrically by a factor of a = 1.25 
in every iteration, starting with A = 10. The spending sequence was % = (l — (1 — e) 1 /™) 
as given in Lemma [4] with e = 0.01 and r = 1000. Multiple testing was always performed at 
threshold a = 0.1. The parameter ttq will be given for every simulation. In the entire section, 



the Benjamini-Hochberg procedure h as defined in Section 2.1 has been used to control the 
FDR. The MCFDR procedure has one tuning parameter, which we set to the recommended value 
(h = 20). 

The effort of MMCTest is measured in terms of N, the total number of samples drawn during 
a run. 
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Figure 2: Empirical CDF of iV/(mlog(m) L5 ) for c = 3 and m £ {100, 1000, 10000}, where N 
denotes the total number of samples. 

4.1 Dependence of the effort on the number of hypotheses 

How does the number of samples N depend on the number of hypotheses? In the following 
simulationjthe distribution of N turns out to be roughly proportional to mlog(m) 3 / 2 . 

Figure [2] shows the empirical CDF of N / '(mlog(m) 3 / 2 ) for m G {100,1000,10000}. Each 
empirical CDF was computed based on 10000 runs. For each value of m, the algorithm was 
run until all but c = 3 hypotheses were classified. P-values were uniformly distributed in [0, 1] 
(ttq = 1). As the empirical CDFs match fairly well, Figure [2] indicates that the dependence of 
iV~ on the number of hypotheses m is roughly of the order mlog(m) 3 / 2 . 

4.2 Dependence of the effort on the stopping rule 

Figure [3] illustrates the convergence of the two sets A n and A n to A* in a single run of MMCTest, 
which we have shown formally in Theorem [2] P-values were generated from the mixture distri- 
bution described at the beginning of Section [4] with ttq = 0.9 and MMCTest was run until all but 
c = 10 hypotheses were classified. 

As most null hypotheses are true (ttq = 0.9) we expect a large proportion of hypotheses to be 
easily classified to lie above the Benjamini-Hochberg line. Indeed, the size of A n drops quickly 
during the first iterations. 

We use a relatively low threshold of a = 0.1 and only 10% of the hypotheses come from 
the alternative. This is why a considerable effort is needed to shrink the confidence intervals 
sufficiently for them to lie entirely below the Benjamini-Hochberg line. This becomes visible in 
Figure [3] as the size of A n remains unchanged over a large period of time and increases only at 
a later stage. 

Figure [4] shows the dependence of the effort on the number c of undecided hypotheses. To 
generate this figure, MMCTest has been applied 1000 times in the following way to m = 5000 
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Figure 3: Convergence of A n (dashed) and A n (solid) to A* for a sample run of MMCTest. Size 
of A n and A n as a function of the number of samples N drawn. The end of every iteration is 
indicated by a circle. Note the log-scale on the horizontal axis. 
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Figure 4: 0.5-, 0.95- and 0.99-quantile of the effort N against the number of unclassified hy- 
potheses c based on 1000 simulations. Log-scale on both axes. 
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Figure 5: Number of samples drawn until leaving the active set B n = A n \A n for each hypothesis 
with a p-value of rank 2000 to 5000. P- values were uniformly distributed in [0, 1]. 



uniformly distributed p- values in [0, 1]: The current size c of set A n \ A n and the current total 
number of samples N c were recorded in each iteration. If several p- values were classified together 
in an iteration, some c did not have a corresponding N c . To be conservative, a missing value 
N c is set to N c i for the largest d < c for which N c i is not missing. Each time the algorithm has 
been run until all but c = 10 hypotheses have been classified. 

The effort is reasonable for classifying all but a few hypotheses. Classifying the last few 
hypotheses seems to be computationally intensive. 

The steps in Figure [4] are caused by several hypotheses with p- values far off the Benjamini- 
Hochberg line being classified together. This effect is explored in Figure [5j which is based on a 
single run with m = 5000 uniformly distributed p- values in [0, 1]. A cross marks the number of 
samples drawn until each hypothesis leaves the active set B n = A n \ A n . In every iteration n 
only one additional sample was drawn for each hypothesis in B n , i.e. we used A = 1 and a = 1. 
MMCTest has been run until all but 2000 hypotheses were classified. Each horizontal line in the 
plot shows that several p- values with different ranks leave the active set at the same time. Many 
hypotheses can be classified after only a few samples. 



4.3 Comparison to MCFDR 



We now compare MMCTest to MCFDR by Sandve et al. (2011). Both algorithms were applied to 



the mixture distribution described at the beginning of Section [4] using proportions 7To of true 
null hypotheses ranging from to 1 in steps of 0.01. 



MCFDR uses hpc (see Section 3.3) by Pounds and Cheng (2006) to control the FDR. Thus 



hpc is used for MMCTest as well, using the procedure described at the end of Section 3.3 



Figure |6^a) shows the number of misclassifications for MCFDR in a single run for every value 
of 7To, obtained by comparing the result of hpc for estimated and exact p-values. Up to 150 
hypotheses are misclassified. The corresponding number of samples drawn by MCFDR is displayed 
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Figure 6: (a) Number of misclassifications for MCFDR based on a single run with 5000 hypotheses; 
(b) Number of samples used by MCFDR (crosses) and matched effort for MMCTest (dots); (c) 
Number of unclassified hypotheses for MMCTest when restricting effort as in (b); (d) Quantiles 
of the effort N of MMCTest when classifying all but 20 hypotheses (based on 100 replications). 
Log-scale on the vertical axis. 
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in plot (b). 

First, we restricted the number of samples drawn by MMCTest to the number drawn by MCFDR, 
see Figure [6^b) , and applied MMCTest to the same p- values used to evaluate MCFDR in Figure 
|6ja). Plot (c) then displays the resulting number of unclassified hypotheses by MMCTest. For 
ttq = almost all hypotheses are still undecided. This then declines until for ttq = 1 almost all 
hypotheses are classified. 

Next, we let MMCTest run until all but c = 20 hypotheses are classified. Figure[6^d) shows 5%, 
50% and 95% quantiles of the number of samples drawn by MMCTest based on 100 independent 
replications of drawing the p-values and running MMCTest. The number of samples needed to 
achieve this classification is mostly substantially higher than the number of samples needed 
by MCFDR. The main advantage of MMCTest is that its result is guaranteed to have with high 
probability no misclassifications as opposed to the large number of misclassifications than can 
occur when using MCFDR. 

Suppose we stopped the MMCTest procedure before the number of samples used by MCFDR was 
reached. Then, as mentioned before, many hypotheses will not be classified. An ad-hoc classifi- 
cation can be obtained by applying hpc to estimates pi = Si/ki of the p-values (see Algorithm 
|2]). This approach leads to a similar number of misclassifications as the MCFDR procedure. This 
illustrates that this variant of our procedure is similar in performance to the MCFDR procedure. 
We do not really recommend this variant; we recommend that whenever the algorithm is stopped 
one is only using the partioning of the hypotheses into rejected, not rejected and not classified 
hypotheses as result of the algorithm. 



5 Discussion 

We presented an open-ended sequential algorithm designed to implement multiple Monte-Carlo 
tests in the setting where p-values can only be approximated through simulation. 

The main feature of the MMCTest algorithm is that its output is guaranteed to be correct (with 
a pre-specified probability), meaning that all classifications are identical to the classifications 
based on the exact p-values. 

Our simulation study shows that a complete classification can be computationally expensive, 
but that most hypotheses can be classified using a reasonable effort. A detailed theoretical 
analysis of the computational effort of the proposed algorithm is outside the scope of the present 
article. 

This article identified conditions which guarantee the bounded risk of classification errors 
and the convergence of the algorithm's output to the classification computed with exact p- 
values. By verifying those conditions, we showed that the algorithm works for the procedure by 



Benjamini and Hochberg (1995) as well as for the Bonferroni correction (Bonferroni, 1936) and 
that it extends to a wider class of FDR controlling procedures. We conjecture that the MMCTest 
algorithm also works for other FDR controlling procedures and for procedures controlling FDR- 
related criteria (e.g. the False Non-Discovery Rate FNR). 



References 

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and 
powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 
(Methodological), 57(l):289-300. 



14 



Besag, J. and Clifford, P. (1991). Sequential Monte Carlo p-values. Biometrika, 78(2):301-304. 



Bonferroni, C. (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del 
R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3-62. 

Clopper, C. and Pearson, E. (1934). The use of confidence or fiducial limits illustrated in the 
case of the binomial. Biometrika, 26(4):404-413. 

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal 
of the American Statistical Association, 58(301):13-30. 

Jiao, S. and Zhang, S. (2010). A mixture model based approach for estimating the FDR in 
replicated microarray data. Journal of Biomedical Science and Engineering, 20(11):317-321. 

Lage-Castellanos, A., Martinez-Montes, E., Hernandez-Cabrera, J., and Galan, L. (2010). False 
discovery rate and permutation test: An evaluation in ERP data analysis. Statistics in 
Medicine, 29:63-74. 

Li, J., Tai, B., and Nott, D. (2009). Confidence interval for the bootstrap P- value and sample 
size calculation of the bootstrap test. Journal of Nonparametric Statistics, 21(5):649-661. 

Pekowska, A., Benoukraf, T., Ferrier, P., and Spicuglia, S. (2010). A unique H3K4me2 profile 
marks tissue-specific gene regulation. Genome Research, 20(11) :1493-1502. 

Pounds, S. and Cheng, C. (2006). Robust estimation of the false discovery rate. Bioinformatics, 
22(16):1979-1987. 

Sandve, C, Ferkingstad, E., and Nygard, S. (2011). Sequential Monte Carlo multiple testing. 
Bioinformatics, 27(23):3235-3241. 



A Appendix 

A.l Proofs of Section [2] 

Proof of Lemma^ 1. As all confidence intervals If are nested by assumption, the sequence 
(max(/™))i=i r ... m is decreasing. Thus by monotonicity of h, 

A n = 7t((max(If)) i=lj ..., m ) C / l ((max(/f +1 )) i= i > ... im ) = A n+1 . 

Similarly, A n D A n+ \ as (min(/f ))j=i ) ... i m is increasing. 

2. Given that pi 6 If \/i,n, the true p-values satisfy pi < max(If) \/i,n. When applied to 
the vectors (pi)j=i v .., m < ij^s^-{If))i=i,...,m-, the monotonicity of h yields 

^ = ^((Pi)i=i,..., m ) 5 /i((max(/f)) i= i 5 ... im ) =A n . 

Similarly, A* C A n as (p;)i=i,..., m > (min(/f ))i=i,..., m . □ 

Proof of Theorem^ We show that there exists no £ N such that for all n > no, 

M( mm/ f)ie{i,...,m}) = h{p*) = M( max/ ")ie{i,...,m})' 
To do this, we show h(p^) = /i(p^ +1 ^), j £ {1, . . . , 6}, where 
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P 



(1) .= 



,(2) •= 



(minJf)j 6{1) ... )m} , 
[mini™ i ^ B n , 
} mini™ i ^ A n , 



P 



(4) . = 



P 



(5) 



P 



(6) 



max/™ i G B n , 

P* * ^ #n, 

max/™ i G A n , 

P*i i i A n , 



and := (max If )i 6 {i ) ... jTO } with B n = A n \ A n . 

(1) Suppose 3i G lim sup^^^ B„. By Condition [2j |Jf | -»■ as n — )• oo. Let 5 be as given in 
the theorem. As B n C {1, . . . , m} is finite Vn G N, there exists no G N such that \If\ < 5 for 
n > uq and all z G limsup n _ 5>00 i? n . Hence, A n = h(p^>) = h(p( 2 ') Vn > no- 

(2) Asj/ 3 - 1 < (maxl™) ig { lj m } and as h is monotonic, A n C h(p^). Asp^ = mini™ < p* = 

p"P Vj G A„ andp^ = p^ 3 ' 1 Vj ^ A„, the second statement of Condition jlj yields h(p^) = h(p^) 
Vn > no- 

(3) As (min/| l )j 6 { lj , >m } < p^ and as h is monotonic, A n D h(p^). As p^ = p* > 

mini™ = p^ 3 "* Vj ^ A n and p^ = p^p Vj G ^4 n , the second statement of Condition jl] yields 
/i(p(3)) = /i(p(4)) = /j(p*) Vn > no- 

Arguing similarly to (1), (2), (3) we can show h{p^) = h(p®), h(pW) = h(p<&) and 
h(p^) = h(pW) = A n . □ 

Proof of Corollary^ By Theorem 2| we have A n — > A*, A n — > A* as n — > oo conditional on 
{p* G If Vi, n}. Under Condition p this event occurs with probability P(p* G If Vi, n) > 1 — e, 
hence F(A n -> A*,A n -»• A*) > 1 - e. □ 

Proof of Lemma^4\ First, we consider an individual Clopper- Pearson confidence interval If com- 
puted in step 2.(c) of Algorithm [2j To ease notation, we drop the indices i and n. 

We show that |/| < 2£, where £ = J ^ logp and p = (n^ — %_a)/2. The following 
probabilities are conditional on S 1 and k. 

Suppose S < k. Then the upper limit p u of the interval I is the solution to ¥(N < S\p = 
Pu) = Pi where N ~ Binomial(£;,p). If p > S7& + £ then, by Hoeffding's inequality ( |Hoeffding 



1963), 



r[N < s) = P _ E fAT) < | _ E (»)) < exp -ri"**. < „. 



£" V £" / £"* \ k* i i \ If* 

f\i \ / Ay \' L // \ Ay 

Thus p u < S /k + If S = k then p u = 1, implying p u < S/k + 

Similarly, it can be shown that the lower limit p/ of / satisfies pi > £/& — £. Hence, |/| = 
Pu-Pi < 2£. 

Now let (5 n ) ng N and Xy G {0,1}, j G N, be as given in Condition [2] and suppose i G 
limsup n _ s>00 B n . Then n — > oo implies fcj — )• oo by the construction of the algorithm. Further- 
more, the sequence rjk = ^ (l — (1 — e) 1 /™) satisfies r]k-f]k-\ ~ A; -2 , implying log(r]k-r]k-A) = 



o{k). As \If \ < 2y fogl^^fci — VK-a)) this implies |If | — > as n — > oo. This proves Condi- 
tion H 

Second, we show that Algorithm [2] computes confidence intervals in such a way that P(p* G 
7™ Vi, n) > 1 — e and thus satisfies Condition |3j 
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Let kf denote the value of k% in iteration n, where kf = 0, and let A n denote the value 
of A in iteration n. Algorithm [2] computes Clopper-Pearson confidence intervals If such that 
* ^ If) < f]k n — i]k n -A n - This then yields for % G {1, ... , m}\ 



HP* G If Vn) = 1 - P(U n {p* If}) >i-J2 F (p* i T i) 

71=1 

OO 

>i-E(^?-w-^) = ( 1 - e ) 1/m 



n=l 



As (Xij) are independent for i G {1, . . . , m} and j G N, the joint probability is P(p* G If Vi, n) = 
II™ i P (ft* G i? Vn) > 1 - e. Condition g is thus satisfied. □ 

A.2 Proofs of Section |3TH 

Proof of Theorem^ As /i is invariant to permutations, we may assume p\ < • • • < p m . 

[TJ Let p G [0, l] m and i G {1, . . . , m}. It suffices to show that h(p) D for any g G [0, l] m 
given by qj = pj Vj / i and % > pj. 

Let k := |/i(p)| be the largest index which is rejected when the Benjamini-Hochberg procedure 
is applied to p. We need to show that j £ h(q) Vj > k + 1. 

Case 1: r g (i) < fc. This implies r,j(j) = j Vj > + 1 and hence g>j = pj > ^ = Q7 1 . 
Therefore, j £ h(q) Vj > k + 1. 

Case 2: r 9 (z) > fc + 1. Let j > k + 1, j ^ i. Then the rank of the jth p- value can only drop by 
one when pi is replaced by i.e. r q (j) G {j — 1, j}. Thus qj = pj > ^ > ■ Furthermore, 
as r q (i) > k + 1, % takes the position of the former p r (i) in the ordered sequence of values from 

q, i.e. qi > p r M)- Hence, r q (i) ^ h(p) because of r q (i) > k + 1 and thus % > PrM) > ^ • 
Therefore, {fc + 1, . . . , m} U {i} £ h(q). This proves statement [TJ 

[2J All i ^ /i(p) satisfy pj > ^ whereas by assumption, qi < G fo(p). Hence, 

using qi = p^ Vi h(p), it follows that r q (i) = r p {i) Vi h(p). Thus, qi = pi > Vp( ^ a = Tq ^ a for 
all i ^ h(p). Hence h(p) c C h(q) c . 

Conversely, define g := maxjqj : i E h(p)}. As q < ^^^ Q < q$ for all i ^ h(p) and as 
there are exactly \h(p)\ values qi < q, the rank of q in q is precisely As qi < q < 

Vi G fa(p), all {qi}i£h(p) are rejected and /i(p) C h(q). This proves statement [2[ 

[3j As gj = pi for all i G h(p), have /i(p) C ft(g). 

Let i £ h(p). If r q (i) < r p (i), then ^ > %r p (i) > %r q (i). If r q (i) > r p (i), qi replaces a 
qj > ^r p (j) at rank r p (j) in the sorted sequence of q, hence r q (i) = r p (j) and % > qj > ^f p (j) = 
^T q (i). Thus qi > B,f q {i) Vi ^ ^(p), which implies /i(p) c C h(q) c . This proves statement [3| □ 



Proof of Corollary^ The first statement of Condition [TJ is satisfied as h is monotonic by The- 
orem [5j 

To prove that the Benjamini-Hochberg procedure h also satisfies the second statement of 
Condition [TJ it suffices to show that for p,q G [0, l] m , both qi < pi Vi G h(p) and qi = pi 
Vi ^ h(p) as well as qi = pi Vi G /i(p) and % > p» Vi ^ /i(p) imply /i(p) = /i(g). 

Indeed, let p,q G [0, l] m be such that < pi Vi G /i(p) and = p« Vi ^ /i(p)- We have 
Pi < ^ ^ ^ ^ definition of h, thus qi < Pi < Vi G /i(p) and /i(p) = h(q) by 

statement [2] of Theorem [5j 
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Similarly, let p,q E [0, l] m be such that qi = pi Vi E /i(p) and % > Pi Vi ^ ^(p). Using 
that pi > ^r p (i) Vi ^ /i(p) it immediately follows that qi > pi > ^r p (i) Vi £ h(p) and thus 
fr(p) = ^(9) by statement [3] of Theorem [5j □ 

Proof of Lemma [?} The result of h remains unchanged if all p- values do not change their rank 
outside of a tie and if no p- value crosses the Benjamini-Hochberg threshold line. 
As h is invariant to permutations, we may assume p* < • • • < p* m . 

Let 5 := min ^ j P ' g 1-1 : i = 2, . . . , m with < p*j U {\p* — ia/m\ : i = 1, . . . ,m}j. 

LetpE [0, l] m with \\p - p*\\ < 5. Thenp*_ x < p* imphes pj_i < p|_ x +5 < P*-o~ < Pi- Thus, 
by possibly permuting indices corresponding to tied values in p, we may assume p* < ■ ■ ■ < p* m 
and pi < • • • < p m . The ranks of the p- values in p* and p are therefore the same. 

Furthermore, \p% — p*\ < 5 < \p* — ia/m\ for all i € {1, ... , m}, implying that p* and pi lie 
on the same side of the Benjamini-Hochberg line. Hence, h{p*) = h(p). □ 

A.3 Proofs of Section 13721 

Proof of Theorem^ [TJ Let p E [0, l] m and i E {1, . . . , m}. It suffices to show that h B (p) 5 
h B (q) for any q E [0, l] m given by (/j = pj Vj 7^ i and qi > pi. 

If Pi > a/m, then i is non-rejected. Increasing pi even further will thus not change the result 
of h B as qi > pi > a/m. Therefore, h B (q) = h B {p). 

If Pi < a/m and > Pi, the result of h B (q) depends on whether % is greater than a/m 
or not. In the first case, i is non-rejected in h B (q), thus h B {q) C h B {p). In the second case, 
qi < a/m is still rejected and thus h B {q) = h B {p). This proves statement [TJ 

[2j All % < ^, i E /ifi(p), are rejected, thus h B (p) Q h B (q). Similarly, all % > ^, i ^ h B (p), 
are non-rejected, thus h B (p) c C h B (q) c . This proves statement [2j □ 

Proof of Corollary^ The first statement of Condition [T] is satisfied as h is monotonic by The- 
orem |HJ 

Statement [2] of Theorem [8] shows that the Bonferroni correction also satisfies the second 
statement of Condition [TJ Indeed, let p, q E [0, l] m be given such that qi < pi Vi E /i(p) and 
<7i > Pi Vi ^ /i(p). By definition of Pi < ^ Vi £ h B (p) and p» > ^ Vi ^ h B (p). In particular, 
% < Pi < ^ Vi E /ib(p) and % > Pi > ^ Vi ^ /is(p), thus /is(p) = /15(g) by statement [2] of 
Theorem H □ 



Proof of Lemma 10. Let <5 := min^^ m } |p* — a/m|. Let p E [0, l] m with ||p — p* \\ < 5. For all 
i E {1, . . . , m}, this implies that pi and p* lie on the same side of the threshold a/m. Therefore, 
h B (p*) = h B (p). □ 

A.4 Proofs of Section [3731 



Proof of Lemma 11. Let i E /i(p, «i). Thus, 3j : r p (j) > r p (i) and m ^rjjj < «i < «2 by 



definition of h, implying i E h(p, 02) and /i(p, ai) C h(p, 02). □ 
Proof of Corollary\lS\ For p < q, the estimator iro satisfies 



m 

i=l 




7f (p) = min pi < min (l, — J^ft ) =7ro(g)- 
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Therefore, as ~ a , \ > 



hpcip) = h[p, -—r^ ) ^hlq, ] Dhlq, — ^ ) = /ipc(<?) 



by statement [T] of Theorem [5] and Lemma 11 □ 
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